Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deadlock issue #9728

Merged
merged 7 commits into from
Jul 29, 2024
Merged

fix: deadlock issue #9728

merged 7 commits into from
Jul 29, 2024

Conversation

amandavialva01
Copy link
Contributor

@amandavialva01 amandavialva01 commented Jul 25, 2024

Ticket

DET-10441

Description

Fix deadlocking issue with creating or verifying existence of a kubernetes namespace

Test Plan

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

Copy link

netlify bot commented Jul 25, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 4a2edd9
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66a791b2cbb4d90008479066

Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 53.44%. Comparing base (202ab62) to head (4a2edd9).
Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9728      +/-   ##
==========================================
- Coverage   53.44%   53.44%   -0.01%     
==========================================
  Files        1257     1257              
  Lines      154350   154359       +9     
  Branches     3298     3298              
==========================================
  Hits        82500    82500              
- Misses      71700    71709       +9     
  Partials      150      150              
Flag Coverage Δ
backend 44.87% <70.00%> (-0.01%) ⬇️
harness 72.69% <ø> (ø)
web 51.92% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/jobs.go 71.40% <70.00%> (+0.11%) ⬆️

... and 3 files with indirect coverage changes

@carolinaecalderon carolinaecalderon added the to-cherry-pick Pull requests that need to be cherry-picked into the current release label Jul 25, 2024
@davidfluck-hpe
Copy link
Contributor

Were you able to reproduce the issue?

@amandavialva01
Copy link
Contributor Author

amandavialva01 commented Jul 25, 2024

Were you able to reproduce the issue?

This PR seems to resolve the issue, which I have tested by helm installing Determined with this commit SHA onto a remote cluster and running the workflow that does produce the deadlock when a SHA from main is used with the helm install instead.
Working on end-to-end tests that I'll add to this PR tomorrow

@davidfluck-hpe
Copy link
Contributor

Fantastic, thank you! What is the workflow that reproduces the issue?

@amandavialva01
Copy link
Contributor Author

amandavialva01 commented Jul 26, 2024

Fantastic, thank you! What is the workflow that reproduces the issue?

If you're running a remote GKE cluster: Create a workspace bound to an auto-created namespace and start a task (I used a JL instance) in that workspace. Then, delete the master deployment pod and wait for the k8s API server to recreate it. Once the pod is back up and running, open the "Edit Workspace" modal in the webUI and save the workspace as is. You'll see the loading sign, and then nothing will load after that 🙈😂

I image that a CLI command such as det w bindings set workspace <workspace_name> --auto-create-namespace would cause the same issue as opening the "Edit" modal and saving the workspace on the UI, because when master restarts, we are no longer correctly tracking the namespaces that we have informers for, so any API call that invokes createNamespace causes the creation of duplicate informers for the same namespace, resulting in a deadlock.

Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I just have a question about when we set j.namespacesWithInformers[namespace] = true (line 260).

@amandavialva01 amandavialva01 requested a review from a team as a code owner July 26, 2024 16:11
@amandavialva01 amandavialva01 requested a review from a team as a code owner July 26, 2024 20:03
@amandavialva01 amandavialva01 requested a review from a team as a code owner July 26, 2024 22:07
@amandavialva01 amandavialva01 force-pushed the amanda/fixDeadlockIssue branch 2 times, most recently from a6ff0b5 to 7e0cdb0 Compare July 26, 2024 22:32
@amandavialva01 amandavialva01 removed the request for review from jgongd July 27, 2024 01:39
@amandavialva01 amandavialva01 force-pushed the amanda/fixDeadlockIssue branch 3 times, most recently from a325786 to a90a6a1 Compare July 29, 2024 12:12
@amandavialva01 amandavialva01 force-pushed the amanda/fixDeadlockIssue branch 2 times, most recently from 1bebba5 to ef3c3d1 Compare July 29, 2024 12:44
Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work.

# function that only gets called when a job is running in the namespace that we want to bind to the
# workspace. Verify that we don't run into this deadlock when triggering multiple calls to the
# API handler that binds a workspace to an auto-created namespace, as this request can trigger the
# deadlock if the namespace (or verify the existence of for the first time) is running a job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this explanation! Great comment and great test.

@@ -35,9 +35,3 @@ stages:
kubeconfig_path: ~/.kube/config
determined_master_ip: $DOCKER_LOCALHOST
determined_master_port: 8080
internal_task_gateway:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is this deleted? Was it a bit of cleanup discovered during testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeaa, I think we only need to use a gateway for mutliRM to make tasks that create resources in the remote cluster (the cluster tied to the additional RM) accessible to the determined master, so single RM clusters don't need that config!

@amandavialva01 amandavialva01 merged commit 9dc0afa into main Jul 29, 2024
77 of 95 checks passed
@amandavialva01 amandavialva01 deleted the amanda/fixDeadlockIssue branch July 29, 2024 14:15
github-actions bot pushed a commit that referenced this pull request Jul 29, 2024
(cherry picked from commit 9dc0afa)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed to-cherry-pick Pull requests that need to be cherry-picked into the current release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants